1 Introduction

1.1 Title of Project

Life Expectancy Factors and Prediction

1.2 Source of dataset

Our study will focus on various factors affecting life expectancy considering demographic variables, income composition and mortality rates. Life expectancy dataset made available to public for the purpose of health data analysis. This dataset is related to life expectancy, health factors from year 2000-2015 for 193 countries has been collected from comes Global Health Observatory (GHO) data repository under World Health Organization. The dataset contains 2938 records and 22 columns. For our project dataset is sourced from Kaggle: Kaggle Dataset Source Link

1.3 Statement of personal interest

  • Research: Our team like to apply the knowledge gained in STAT420 course to understand how data analysis techniques and linear regression model can guide us to build best possible model and predict within a reasonable amount of confidence the life expectancy of different countries around the world. This will eventually help government officials adopt the necessary steps to help increase the life expectancy.

  • Inspiration and Personal interest: We would like to explore how statistics combined with technology can make difference in real-world use cases impacting every individual, provides guidelines to government policies and encourage them take informed decisions in time. We are planning to use different predictors like BMI, alcohol intake, HIV/aids and many more. Our aim is through statistics to help determining the predicting factors that are contributing to lower value of life expectancy. Explainable model clearly identify predictors contributing to life expectancy and by improving certain predictors (for example improving the polio vaccine intake or reducing alcohol) life expectancy can be improved.

1.4 Description of the dataset

In the following dataset, Life expectancy is the response continuous variable, Country and Status are the categorical explanatory variables and rest are either continuous numeric or numeric explanatory variables.

Table 1.1: Data Description
Variables Type Description
Country character Country
Year numerical Year
Status character Developed or Developing status
Life expectancy numerical Life Expectancy in age
Adult Mortality numerical Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
Infant deaths numerical Number of Infant Deaths per 1000 population
Alcohol numerical Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
Percentage expenditure numerical Expenditure on health as a percentage of Gross Domestic Product per capita(%)
Hepatitis B numerical Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
Measles numerical Measles - number of reported cases per 1000 population
BMI numerical Average Body Mass Index of entire population
Under-five deaths numerical Number of under-five deaths per 1000 population
Polio numerical Polio (Pol3) immunization coverage among 1-year-olds (%)
Total expenditure numerical General government expenditure on health as a percentage of total government expenditure (%)
Diphtheria numerical Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
HIV/AIDS numerical Deaths per 1 000 live births HIV/AIDS (0-4 years)
GDP numerical Gross Domestic Product per capita (in USD)
Population numerical Population of the country
Thinness 1-19 yrs numerical Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
Thinness 5-9 yrs numerical Prevalence of thinness among children for Age 5 to 9(%)
Income comp. of resources numerical Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
Schooling numerical Number of years of Schooling(years)

1.5 Goal of this model

The goal of the model is to identify what are the key contributing factors towards life expectancy and to accurately predict the life expectancy.

2 Method

2.1 Data Cleaning

2.1.1 Investigations - Visually looking at data statistics

Now we investigate the dataset to see what actions we need to take with the dataset before creating a model

The data source format is csv and can be read into R, below are the sample records and variable data types. Since the output is very long, we use echo=FALSE, but as a sample, we show the output below for the first few rows

Table 2.1: Data Structure
Country Year Status Life.expectancy Adult.Mortality infant.deaths Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths Polio Total.expenditure Diphtheria HIV.AIDS GDP Population thinness..1.19.years thinness.5.9.years Income.composition.of.resources Schooling
Afghanistan 2015 Developing 65.0 263 62 0.01 71.27962 65 1154 19.1 83 6 8.16 65 0.1 584.2592 33736494 17.2 17.3 0.479 10.1
Afghanistan 2014 Developing 59.9 271 64 0.01 73.52358 62 492 18.6 86 58 8.18 62 0.1 612.6965 327582 17.5 17.5 0.476 10.0
Afghanistan 2013 Developing 59.9 268 66 0.01 73.21924 64 430 18.1 89 62 8.13 64 0.1 631.7450 31731688 17.7 17.7 0.470 9.9
Afghanistan 2012 Developing 59.5 272 69 0.01 78.18421 67 2787 17.6 93 67 8.52 67 0.1 669.9590 3696958 17.9 18.0 0.463 9.8
Afghanistan 2011 Developing 59.2 275 71 0.01 7.09711 68 3013 17.2 97 68 7.87 68 0.1 63.5372 2978599 18.2 18.2 0.454 9.5
Afghanistan 2010 Developing 58.8 279 74 0.01 79.67937 66 1989 16.7 102 66 9.20 66 0.1 553.3289 2883167 18.4 18.4 0.448 9.2

2.1.2 Data Cleanup based on data inspection

Rename these columns

Most of the columns are not one word so we will rename them to one word for ease of programming

names(life_data)[c(4,5,6,8,9,12,14,16,19,20,21)] =  c("Life_expectancy","Adult_Mortality","Infant_deaths",
                           "Percentage_expenditure","Hepatitis_B","Under_5_deaths",
                           "Total_expenditure","HIV_AIDS","Thinness_1_to_19_yrs","Thinness_5_to_9_yrs",
                           "Income_composition_of_resources")

Remove these columns

  • Remove Country column since it is not a predictor and is simply a country name
life_data = subset(life_data, select = -c(`Country`))

Other Observations

Status column is a factor variable with two levels - Developed and Developing

is.factor(life_data$Status)
## [1] FALSE
life_data$Status = as.factor(life_data$Status)
levels(life_data$Status)
## [1] "Developed"  "Developing"
# Box plot
boxplot(Life_expectancy ~ Status, data = life_data, 
     xlab = "Status",
     ylab = "Life expectancy",
     main = "Life expectancy vs Status",
     pch = 20,
     cex = 2,
     col = "darkorange",
     border = "dodgerblue"
     )

We can observe the following two points from the box plot * Mean of Life Expectancy of a developed nation is higher than the mean of a developing nation * Variance in Life Expectancy of a developing nation is higher than that of a developed nation

There is some missing data. We will do a quick summary on how many values is each variable missing

v = sort(colSums(is.na(life_data)),decreasing = TRUE)
knitr::kable(v, col.names = c("NA Count"),align = "ll", caption = "Variables with missing value count")
Table 2.2: Variables with missing value count
NA Count
Population 652
Hepatitis_B 553
GDP 448
Total_expenditure 226
Alcohol 194
Income_composition_of_resources 167
Schooling 163
BMI 34
Thinness_1_to_19_yrs 34
Thinness_5_to_9_yrs 34
Polio 19
Diphtheria 19
Life_expectancy 10
Adult_Mortality 10
Year 0
Status 0
Infant_deaths 0
Percentage_expenditure 0
Measles 0
Under_5_deaths 0
HIV_AIDS 0
  • Population - 652 records have missing values.
  • Hepatitis.B - 553 records have missing values.
  • GDP - 448 records have missing values.
  • Total Expenditure - 226 records have missing values.
  • Alcohol - 194 records have missing values.
  • Income composition of resources - 167 records have missing values.
  • Schooling - 163 records have missing values.
  • BMI - 34 records have missing values.
  • Thinness 1.19 years - 34 records have missing values.
  • Thinness 5.9 years - 34 records have missing values.
  • Polio - 19 records have missing values.
  • Diphtheria - 19 records have missing values.
  • Adult Mortality - 10 records have missing values.
  • Life expectancy - 10 records have missing values.
#remove missing observations
life_data = na.omit(life_data)

2.1.3 Investigation - Correlation

Now, to identify columns that have high correlations. First we plot the correlation matrix

corrplot(cor(life_data[,3:21]))

We define high correlation as great than 70% correlation.

In order to identify correlation we first create a dataframe that has only numeric predictors

life_data_with_char_cols_removed = subset(life_data, select = -c(`Status`))

Now we identify the correlation in numeric predictors. We mark all correlations less than .70 as NA, so that we can easily identify highly correlated features

For brevity of the report, we only show the head of the correlation matrix, while we had looked at the entire matrix to come to our conclusions below

cor_relation = cor(life_data_with_char_cols_removed, use = "complete.obs")

cor_relation[abs(cor_relation) < 0.80] <- NA
head(cor_relation,20)
##                                 Year Life_expectancy Adult_Mortality
## Year                               1              NA              NA
## Life_expectancy                   NA               1              NA
## Adult_Mortality                   NA              NA               1
## Infant_deaths                     NA              NA              NA
## Alcohol                           NA              NA              NA
## Percentage_expenditure            NA              NA              NA
## Hepatitis_B                       NA              NA              NA
## Measles                           NA              NA              NA
## BMI                               NA              NA              NA
## Under_5_deaths                    NA              NA              NA
## Polio                             NA              NA              NA
## Total_expenditure                 NA              NA              NA
## Diphtheria                        NA              NA              NA
## HIV_AIDS                          NA              NA              NA
## GDP                               NA              NA              NA
## Population                        NA              NA              NA
## Thinness_1_to_19_yrs              NA              NA              NA
## Thinness_5_to_9_yrs               NA              NA              NA
## Income_composition_of_resources   NA              NA              NA
## Schooling                         NA              NA              NA
##                                 Infant_deaths Alcohol Percentage_expenditure
## Year                                       NA      NA                     NA
## Life_expectancy                            NA      NA                     NA
## Adult_Mortality                            NA      NA                     NA
## Infant_deaths                        1.000000      NA                     NA
## Alcohol                                    NA       1                     NA
## Percentage_expenditure                     NA      NA               1.000000
## Hepatitis_B                                NA      NA                     NA
## Measles                                    NA      NA                     NA
## BMI                                        NA      NA                     NA
## Under_5_deaths                       0.996906      NA                     NA
## Polio                                      NA      NA                     NA
## Total_expenditure                          NA      NA                     NA
## Diphtheria                                 NA      NA                     NA
## HIV_AIDS                                   NA      NA                     NA
## GDP                                        NA      NA               0.959299
## Population                                 NA      NA                     NA
## Thinness_1_to_19_yrs                       NA      NA                     NA
## Thinness_5_to_9_yrs                        NA      NA                     NA
## Income_composition_of_resources            NA      NA                     NA
## Schooling                                  NA      NA                     NA
##                                 Hepatitis_B Measles BMI Under_5_deaths Polio
## Year                                     NA      NA  NA             NA    NA
## Life_expectancy                          NA      NA  NA             NA    NA
## Adult_Mortality                          NA      NA  NA             NA    NA
## Infant_deaths                            NA      NA  NA       0.996906    NA
## Alcohol                                  NA      NA  NA             NA    NA
## Percentage_expenditure                   NA      NA  NA             NA    NA
## Hepatitis_B                               1      NA  NA             NA    NA
## Measles                                  NA       1  NA             NA    NA
## BMI                                      NA      NA   1             NA    NA
## Under_5_deaths                           NA      NA  NA       1.000000    NA
## Polio                                    NA      NA  NA             NA     1
## Total_expenditure                        NA      NA  NA             NA    NA
## Diphtheria                               NA      NA  NA             NA    NA
## HIV_AIDS                                 NA      NA  NA             NA    NA
## GDP                                      NA      NA  NA             NA    NA
## Population                               NA      NA  NA             NA    NA
## Thinness_1_to_19_yrs                     NA      NA  NA             NA    NA
## Thinness_5_to_9_yrs                      NA      NA  NA             NA    NA
## Income_composition_of_resources          NA      NA  NA             NA    NA
## Schooling                                NA      NA  NA             NA    NA
##                                 Total_expenditure Diphtheria HIV_AIDS      GDP
## Year                                           NA         NA       NA       NA
## Life_expectancy                                NA         NA       NA       NA
## Adult_Mortality                                NA         NA       NA       NA
## Infant_deaths                                  NA         NA       NA       NA
## Alcohol                                        NA         NA       NA       NA
## Percentage_expenditure                         NA         NA       NA 0.959299
## Hepatitis_B                                    NA         NA       NA       NA
## Measles                                        NA         NA       NA       NA
## BMI                                            NA         NA       NA       NA
## Under_5_deaths                                 NA         NA       NA       NA
## Polio                                          NA         NA       NA       NA
## Total_expenditure                               1         NA       NA       NA
## Diphtheria                                     NA          1       NA       NA
## HIV_AIDS                                       NA         NA        1       NA
## GDP                                            NA         NA       NA 1.000000
## Population                                     NA         NA       NA       NA
## Thinness_1_to_19_yrs                           NA         NA       NA       NA
## Thinness_5_to_9_yrs                            NA         NA       NA       NA
## Income_composition_of_resources                NA         NA       NA       NA
## Schooling                                      NA         NA       NA       NA
##                                 Population Thinness_1_to_19_yrs
## Year                                    NA                   NA
## Life_expectancy                         NA                   NA
## Adult_Mortality                         NA                   NA
## Infant_deaths                           NA                   NA
## Alcohol                                 NA                   NA
## Percentage_expenditure                  NA                   NA
## Hepatitis_B                             NA                   NA
## Measles                                 NA                   NA
## BMI                                     NA                   NA
## Under_5_deaths                          NA                   NA
## Polio                                   NA                   NA
## Total_expenditure                       NA                   NA
## Diphtheria                              NA                   NA
## HIV_AIDS                                NA                   NA
## GDP                                     NA                   NA
## Population                               1                   NA
## Thinness_1_to_19_yrs                    NA             1.000000
## Thinness_5_to_9_yrs                     NA             0.927913
## Income_composition_of_resources         NA                   NA
## Schooling                               NA                   NA
##                                 Thinness_5_to_9_yrs
## Year                                             NA
## Life_expectancy                                  NA
## Adult_Mortality                                  NA
## Infant_deaths                                    NA
## Alcohol                                          NA
## Percentage_expenditure                           NA
## Hepatitis_B                                      NA
## Measles                                          NA
## BMI                                              NA
## Under_5_deaths                                   NA
## Polio                                            NA
## Total_expenditure                                NA
## Diphtheria                                       NA
## HIV_AIDS                                         NA
## GDP                                              NA
## Population                                       NA
## Thinness_1_to_19_yrs                       0.927913
## Thinness_5_to_9_yrs                        1.000000
## Income_composition_of_resources                  NA
## Schooling                                        NA
##                                 Income_composition_of_resources Schooling
## Year                                                         NA        NA
## Life_expectancy                                              NA        NA
## Adult_Mortality                                              NA        NA
## Infant_deaths                                                NA        NA
## Alcohol                                                      NA        NA
## Percentage_expenditure                                       NA        NA
## Hepatitis_B                                                  NA        NA
## Measles                                                      NA        NA
## BMI                                                          NA        NA
## Under_5_deaths                                               NA        NA
## Polio                                                        NA        NA
## Total_expenditure                                            NA        NA
## Diphtheria                                                   NA        NA
## HIV_AIDS                                                     NA        NA
## GDP                                                          NA        NA
## Population                                                   NA        NA
## Thinness_1_to_19_yrs                                         NA        NA
## Thinness_5_to_9_yrs                                          NA        NA
## Income_composition_of_resources                               1        NA
## Schooling                                                    NA         1

2.1.4 Data Cleanup based on correlation

Based on the above correlation matrix and corroborated by the correlation plot , we make the below observations

Remove the following predictors

  • Infant_deaths & Under_5_deaths are highly correlated . We remove infant_deaths
  • Percentage_expenditure & GDP are highly correlated. Give GDP also had a lot of missing values we remove Hence we remove GDP
  • Thinness_1_to_19_yrs & Thinness_5_to_9_yrs are highly correlated. Hence we remove Thinness_5_to_9_yrs
life_data_clean = subset(life_data, select = -c(`Infant_deaths`, `GDP`, `Thinness_5_to_9_yrs`))

dim(life_data_clean)
## [1] 1649   18

2.1.5 Data Cleanup to handle missing values

We know that step wont work with missing values. Hence we will need to remove those.

We started with 1649 observations and 21 columns and after our initial cleanup we end up with 1649 observations and 18 columns

2.2 Transformation identification

2.2.1 Pairs plot

Now, before we begin modeling, we look at the pairs plots to see if any of the parameters are an obvious choice for transformations

life_data_with_char_cols_removed = subset(life_data_clean, select = -c(`Status`))
pairs(life_data_with_char_cols_removed , col = "dodgerblue")

We make the following observations from the plot

  • Life Expectancy have strong negative correlation with Adult Mortality and HIV AIDS.
  • Life Expextancy has positive correlation with Income comp. of resources and Schooling.
  • Slightly positive correlation between Life Expectancy and BMI.
  • Slightly positive correlation between Polio and Diphtheria.

Potential for transformations

From pair plot we definitely see need for some transformation specially with Adult Mortality and Income comp. of resources. We will get exact tranformation format with boxcox procedure.

Diagnostics Function Diagnostics function for Fitted versus Residuals & Normal Q-Q plots and testing model against Shapiro-Wilk & Breusch-Pagan test

diagnostics = function(model, pcol = "grey", lcol = "dodgerblue", alpha = 0.01,
                        plotit = TRUE, testit = TRUE) {
  if(plotit == TRUE) {
    par(mfrow=c(1,2)) 
    plot(fitted(model), resid(model), col = pcol, pch = 20,
     xlab = "Fitted", ylab = "Residuals.",
     main = "Fitted versus Residuals")
    abline(h = 0, col = lcol, lwd = 2)
    
    qqnorm(resid(model), main = "Normal Q-Q Plot", col = pcol)
    qqline(resid(model), col = lcol, lwd = 2)
  }
  if (testit == TRUE) {
    normality_p_val = shapiro.test(resid(model))$p.value
    normality_decision = ifelse(normality_p_val < alpha, "Reject", "Fail to Reject.")
    (bptest_pvalue =  bptest(model)$"p.value"[[1]])
    const_variance_decision = ifelse(bptest_pvalue < alpha, "Reject", "Fail to Reject.")
    rmse = round(sqrt(mean(resid(model) ^ 2)), 4)
    aic = round(extractAIC(model)[2], 2)
    num_predictors = length(coef(model)) - 1

    res = list(num_predictors = num_predictors, sw_pvalue = normality_p_val, sw_decision = normality_decision,
               bp_pvalue = bptest_pvalue, bp_decision = const_variance_decision, RMSE = rmse,
               AdjustedR2 = round(summary(model)$"adj.r.squared", 5), AIC=aic)
  
  return(res)
  }
  
  
}

In order to validate that transformations are necessary we will start with a simple additive model and look at its diagnostics plots

model_1 = lm( Life_expectancy  ~ . , data=life_data_clean)
knitr::kable(data.frame(diagnostics(model_1)))

num_predictors sw_pvalue sw_decision bp_pvalue bp_decision RMSE AdjustedR2 AIC
17 0 Reject 0 Reject 3.6097 0.82976 4269.41

Based on the diagnostics and the plots we see that some kind of transformation for the response is necessary.

2.2.2 Boxcox lambda identifications for response and predictors

2.2.2.1 Response transformation identification

In order to figure out the transformation for the response, we find the lambda for it

bc = boxcox(model_1)

We know that the most common Box-Cox Transformations are

\(\lambda\) Transformed Data
-2 \(y^{-2}\)
-1 \(y^{-1}\)
-.5 \(1 \over \sqrt y\)
0 ln(y)
.5 \(\sqrt y\)
1 y
2 \(y^2\)

since our \(\lambda\) is close to 2 we will do \(y^2\) transformations

We redo the model and look at the diagnostics plots again

Apply transformation and test model

#fit new linear regression model using the Box-Cox transformation
new_model = lm(Life_expectancy ^ 2 ~ ., data=life_data_clean)
knitr::kable(data.frame(diagnostics(new_model)))

num_predictors sw_pvalue sw_decision bp_pvalue bp_decision RMSE AdjustedR2 AIC
17 3e-06 Reject 0 Reject 495.775 0.81997 20503.8

We see that the plots and the diagnostics are a lot better, but there seems to be some scope for improvement.

2.2.2.2 Predictors transformation identification

Let us now identify the lambda transformations for the other columns we identified and using those variables as response, fit the model, but keep Life_expectancy ^ 2 in the predictor with others

m3 = lm(Adult_Mortality ~ . - Life_expectancy +  Life_expectancy ^ 2, data = life_data_clean)
boxcox(m3,xlab = "lambda for Adult_Mortality")

We should apply \(\sqrt y\) transformation to Adult_Mortality since \(\lambda\) is close to 0.5

m3 = lm(Alcohol ~ . - Life_expectancy +  Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Alcohol")

We should apply \(\sqrt y\) transformation to Alcohol since \(\lambda\) is close to 0.5

m3 = lm(Hepatitis_B ~ . - Life_expectancy +  Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for Hepatitis_B")

We should apply \(y^2\) transformation to Hepatitis_B since \(\lambda\) is close to 2

m3 = lm(HIV_AIDS ~ . - Life_expectancy +  Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for HIV_AIDS")

We should apply 1/\(\sqrt HIV AIDS\) transformation to HIV_AIDS since \(\lambda\) is close to -0.5

m3 = lm(Diphtheria ~ . - Life_expectancy +  Life_expectancy ^ 2  , data = life_data_clean)
boxcox(m3,xlab = "lambda for Diphtheria")

We should apply \(y^2\) transformation to Diphtheria since \(\lambda\) is close to 2

m3 = lm(Total_expenditure ~ . - Life_expectancy +  Life_expectancy ^ 2  , data = life_data_clean)
boxcox(m3,xlab = "lambda for Total_expenditure")

We should apply no transformation to Total_expenditure since \(\lambda\) is close to 1.0

m3 = lm(Population ~ . - Life_expectancy +  Life_expectancy ^ 2  , data = life_data_clean)
boxcox(m3,xlab = "lambda for Population")

We should apply log transformation to Population since \(\lambda\) is close to 0

m3 = lm(Polio ~ . - Life_expectancy +  Life_expectancy ^ 2  , data = life_data_clean)
boxcox(m3,xlab = "lambda for Polio")

We should apply \(y^2\) transformation to Polio since \(\lambda\) is close to 2

m3 = lm(Thinness_1_to_19_yrs ~ . - Life_expectancy +  Life_expectancy ^ 2  , data = life_data_clean)
boxcox(m3,xlab = "lambda Thinness_1_to_19_yrs ")

We should apply log transformation to Thinness_1_to_19_yrs since \(\lambda\) is close to 0

m3 = lm(BMI ~ . - Life_expectancy +  Life_expectancy ^ 2 , data = life_data_clean)
boxcox(m3,xlab = "lambda for BMI")

No transformation needed for BMI since \(\lambda\) is close to 1

2.3 Model Identification

2.3.1 Models - Simple Additive, steps with AIC & BIC and transformation

Based on the above analysis we create the below models to start with

  1. A simple additive model
  2. An exhaustive search with backward and forward AIC & BIC
  3. A model with the above transformations

2.3.1.1 Additive

model_additive = lm (Life_expectancy  ~ ., data = life_data_clean)

2.3.1.3 AIC

le_model_aic = n * log(all_models$rss/n) + 2 * (2:p)
(best_aic_ind = which.min(le_model_aic))
## [1] 8
all_models$which[best_aic_ind,]
##                     (Intercept)                            Year 
##                            TRUE                            TRUE 
##                StatusDeveloping                 Adult_Mortality 
##                           FALSE                            TRUE 
##                         Alcohol          Percentage_expenditure 
##                           FALSE                            TRUE 
##                     Hepatitis_B                         Measles 
##                           FALSE                           FALSE 
##                             BMI                  Under_5_deaths 
##                            TRUE                           FALSE 
##                           Polio               Total_expenditure 
##                           FALSE                           FALSE 
##                      Diphtheria                        HIV_AIDS 
##                            TRUE                            TRUE 
##                      Population            Thinness_1_to_19_yrs 
##                           FALSE                           FALSE 
## Income_composition_of_resources                       Schooling 
##                            TRUE                            TRUE
#store to use for later
le_model_best_aic = lm (Life_expectancy ~ Year +
                        + Adult_Mortality + Percentage_expenditure + BMI + Diphtheria + HIV_AIDS+ 
                        + Schooling + Income_composition_of_resources, data = life_data_clean)
# model complexity
plot( le_model_aic ~ I(2:p), 
     xlab = "p = Number of parameters",
     ylab = "AIC",
     main = "AIC vs Model complexity",
     pch = 20,
     cex = 2,
     col = "darkorange",
     type = "b"
     )

2.3.1.3.1 Backward AIC
#backward aic
le_model_back_aic = step(model_additive, trace = 0)
coef(le_model_back_aic)
##                     (Intercept)                            Year 
##                     3.44567e+02                    -1.45521e-01 
##                StatusDeveloping                 Adult_Mortality 
##                    -8.94369e-01                    -1.72102e-02 
##                         Alcohol          Percentage_expenditure 
##                    -1.73065e-01                     4.49088e-04 
##                         Measles                             BMI 
##                     1.55660e-05                     3.57100e-02 
##                  Under_5_deaths               Total_expenditure 
##                    -3.11513e-03                     8.97790e-02 
##                      Diphtheria                        HIV_AIDS 
##                     2.05685e-02                    -4.52247e-01 
##                      Population Income_composition_of_resources 
##                     2.89972e-09                     1.11996e+01 
##                       Schooling 
##                     9.32685e-01
2.3.1.3.2 Forward AIC
#forward aic

le_model_fwd_aic_start = lm (Life_expectancy ~ 1, data = life_data_clean)
le_model_fwd_aic = step(le_model_fwd_aic_start,
                        scope =Life_expectancy ~ 
                        Year
                        + Status
                        + Income_composition_of_resources
                        + Adult_Mortality
                        + Total_expenditure
                        + Alcohol
                        + Percentage_expenditure
                        + Hepatitis_B
                        + Measles
                        + BMI
                        + Under_5_deaths 
                        + Polio
                        + HIV_AIDS
                        + Diphtheria
                        + Thinness_1_to_19_yrs
                        + Schooling
                        ,
                        direction = "forward", trace = 0)

2.3.1.4 BIC

le_model_bic = n * log(all_models$rss/n) + log(n) * (2:p)
(best_bic_ind = which.min(le_model_bic))
## [1] 8
all_models$which[best_bic_ind,]
##                     (Intercept)                            Year 
##                            TRUE                            TRUE 
##                StatusDeveloping                 Adult_Mortality 
##                           FALSE                            TRUE 
##                         Alcohol          Percentage_expenditure 
##                           FALSE                            TRUE 
##                     Hepatitis_B                         Measles 
##                           FALSE                           FALSE 
##                             BMI                  Under_5_deaths 
##                            TRUE                           FALSE 
##                           Polio               Total_expenditure 
##                           FALSE                           FALSE 
##                      Diphtheria                        HIV_AIDS 
##                            TRUE                            TRUE 
##                      Population            Thinness_1_to_19_yrs 
##                           FALSE                           FALSE 
## Income_composition_of_resources                       Schooling 
##                            TRUE                            TRUE
#store to use for later

le_model_best_bic = lm (Life_expectancy ~ Year +
                        + Adult_Mortality + Percentage_expenditure + Schooling + BMI + HIV_AIDS+ 
                        + Diphtheria + Income_composition_of_resources, data = life_data_clean)
2.3.1.4.1 Backward BIC
#backward bic

le_model_back_bic = step(model_additive, trace = 0, k = log(n))
coef(le_model_back_bic)
##                     (Intercept)                            Year 
##                   345.452945985                    -0.146367347 
##                 Adult_Mortality                         Alcohol 
##                    -0.017556785                    -0.137593048 
##          Percentage_expenditure                             BMI 
##                     0.000495584                     0.036261326 
##                  Under_5_deaths                      Diphtheria 
##                    -0.001883351                     0.021854602 
##                        HIV_AIDS Income_composition_of_resources 
##                    -0.449795341                    11.184397773 
##                       Schooling 
##                     0.961446467
2.3.1.4.2 Forward BIC
#forward bic

le_model_fwd_bic_start = lm (Life_expectancy ~ 1, data = life_data_clean)
le_model_fwd_bic = step(le_model_fwd_bic_start,
                        scope =Life_expectancy ~ 
                        Year
                       + Status
                        + Income_composition_of_resources
                        + Adult_Mortality
                        + Total_expenditure
                        + Alcohol
                        + Percentage_expenditure
                        + Hepatitis_B
                        + Measles
                        + BMI
                        + Under_5_deaths 
                        + Polio
                        + HIV_AIDS
                        + Diphtheria
                        + Thinness_1_to_19_yrs
                        ,
                        direction = "forward", k = log(n), trace = 0)
2.3.1.4.3 Transformation
#model_additive_backward_aic = step (model_additive, trace = 0)
model_transform =  lm (Life_expectancy ^ 2  ~ Status + sqrt(Alcohol)
                        + sqrt(Adult_Mortality) + Income_composition_of_resources ^ 2 + 1/sqrt(Total_expenditure) + log(Population)
                        + BMI + 1/sqrt(HIV_AIDS) + Diphtheria ^ 2 + Hepatitis_B ^ 2 + Measles + Percentage_expenditure
                        + Polio ^2 + log(Thinness_1_to_19_yrs),
                 data = life_data_clean)

model_transform_backward_aic =  step (model_transform, trace = 0)
2.3.1.4.4 Polynomial

Based on playing around with the data and trying various different interaction between variables we came up with following interactive polynomial transformation model.

model_poly = lm(Life_expectancy~sqrt(Adult_Mortality)*(Income_composition_of_resources^2)*Percentage_expenditure*Schooling*BMI+HIV_AIDS+Diphtheria+Polio+Year*Status,data=life_data_clean)

To get the rmse of the transformed model we need to get the fitted values back to the original scale

rmse_transformed = sqrt(mean((life_data_clean$Life_expectancy - sqrt(fitted(model_transform)))^2))
rmse_transform_backward_aic = sqrt(mean((life_data_clean$Life_expectancy - sqrt(fitted(model_transform_backward_aic)))^2))
df_result = rbind(m_additive = diagnostics(model_additive,plotit = FALSE), 
                   #m_additive_step = diagnostics(model_additive_backward_aic,plotit = FALSE), 
                   m_transform_1 = diagnostics(model_transform,plotit = FALSE), 
                   m_transform_1_step = diagnostics(model_transform,plotit = FALSE),
                   model_min_aic = diagnostics(le_model_best_aic,plotit = FALSE),
                   model_min_bic = diagnostics(le_model_best_bic,plotit = FALSE),
                   model_back_aic = diagnostics(le_model_back_aic,plotit = FALSE),
                   model_back_bic = diagnostics(le_model_back_bic,plotit = FALSE),
                   model_fwd_aic = diagnostics(le_model_fwd_aic,plotit = FALSE),
                   model_fwd_bic = diagnostics(le_model_fwd_bic,plotit = FALSE),
                   model_poly = diagnostics(model_poly,plotit = FALSE)
                    
                  )

#update the rmse of the transformed models

df_result['m_transform_1',]$RMSE = round(rmse_transformed,4)
df_result['m_transform_1_step',]$RMSE = round(rmse_transform_backward_aic,4)

knitr::kable(df_result, align = "cccccccc", caption = "Model Comparison Summary")
Table 2.3: Model Comparison Summary
num_predictors sw_pvalue sw_decision bp_pvalue bp_decision RMSE AdjustedR2 AIC
m_additive 17 5.80987998863635e-08 Reject 6.41579265092972e-33 Reject 3.6097 0.82976 4269.41
m_transform_1 12 5.58980035080847e-15 Reject 3.40055212895347e-86 Reject 4.8562 0.71269 21269.62
m_transform_1_step 12 5.58980035080847e-15 Reject 3.40055212895347e-86 Reject 4.8573 0.71269 21269.62
model_min_aic 8 6.15431590889256e-10 Reject 3.4663551604506e-29 Reject 3.6678 0.8252 4304.07
model_min_bic 8 6.15431590889256e-10 Reject 3.46635516045084e-29 Reject 3.6678 0.8252 4304.07
model_back_aic 14 4.22813740389515e-08 Reject 1.3205694135825e-34 Reject 3.6138 0.82969 4267.14
model_back_bic 10 2.23394873540412e-08 Reject 4.56792467310997e-35 Reject 3.6319 0.8284 4275.64
model_fwd_aic 13 2.64491669882093e-08 Reject 1.37799693326893e-34 Reject 3.617 0.82949 4268.02
model_fwd_bic 10 5.30301565996376e-13 Reject 1.60142326938387e-81 Reject 3.8867 0.80348 4499.21
model_poly 37 6.78173115056804e-08 Reject 6.79157135486358e-17 Reject 3.5305 0.83513 4236.26

Looking at the above table, we can clearly see that the polynomial model is yielding a good model. The RMSE is the lowest but it has does not have the lowest predictors. Also its adj R2 is the highest

2.4 Model selection

2.4.1 Anova tests

We now perform Anova tests

First test between le_model_back_aic and model_poly

anova(le_model_back_aic, model_poly)
## Analysis of Variance Table
## 
## Model 1: Life_expectancy ~ Year + Status + Adult_Mortality + Alcohol + 
##     Percentage_expenditure + Measles + BMI + Under_5_deaths + 
##     Total_expenditure + Diphtheria + HIV_AIDS + Population + 
##     Income_composition_of_resources + Schooling
## Model 2: Life_expectancy ~ sqrt(Adult_Mortality) * (Income_composition_of_resources^2) * 
##     Percentage_expenditure * Schooling * BMI + HIV_AIDS + Diphtheria + 
##     Polio + Year * Status
##   Res.Df   RSS Df Sum of Sq     F   Pr(>F)    
## 1   1634 21535                                
## 2   1611 20554 23       981 3.343 1.62e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Second between two best models model_additive and model_poly

anova(model_additive, model_poly)
## Analysis of Variance Table
## 
## Model 1: Life_expectancy ~ Year + Status + Adult_Mortality + Alcohol + 
##     Percentage_expenditure + Hepatitis_B + Measles + BMI + Under_5_deaths + 
##     Polio + Total_expenditure + Diphtheria + HIV_AIDS + Population + 
##     Thinness_1_to_19_yrs + Income_composition_of_resources + 
##     Schooling
## Model 2: Life_expectancy ~ sqrt(Adult_Mortality) * (Income_composition_of_resources^2) * 
##     Percentage_expenditure * Schooling * BMI + HIV_AIDS + Diphtheria + 
##     Polio + Year * Status
##   Res.Df   RSS Df Sum of Sq     F   Pr(>F)    
## 1   1631 21487                                
## 2   1611 20554 20     932.3 3.654 9.02e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From both the anova tests its apparent that larger model `model_poly’ is statistically preferred.

2.4.2 Model Evaluation

In this section we’ll evaluate prediction abilities of the models. First let’s split the data set to training and “test”. 75% for training and 25% for test datasets

trn_idx = sample(nrow(life_data_clean), round(nrow(life_data_clean) * 0.75))
life_data_clean_trn = life_data_clean[trn_idx, ]
life_data_clean_tst = life_data_clean[-trn_idx, ]

Train several found models:

model_additive_pred = lm (Life_expectancy  ~ ., data = life_data_clean_trn)
model_transform_pred =  lm (Life_expectancy ^ 2  ~ Status + sqrt(Alcohol)
                        + sqrt(Adult_Mortality) + Income_composition_of_resources ^ 2 + 1/sqrt(Total_expenditure) + log(Population)
                        + BMI + 1/sqrt(HIV_AIDS) + Diphtheria ^ 2 + Hepatitis_B ^ 2 + Measles + Percentage_expenditure
                        + Polio ^2 + log(Thinness_1_to_19_yrs),
                 data = life_data_clean_trn)
model_best_aic_pred = lm (Life_expectancy ~ Year +
                        + Adult_Mortality + Percentage_expenditure + BMI + Diphtheria + HIV_AIDS+ 
                        + Schooling + Income_composition_of_resources, data = life_data_clean_trn)
model_best_bic_pred = lm (Life_expectancy ~ Year +
                        + Adult_Mortality + Percentage_expenditure + Schooling + BMI + HIV_AIDS+ 
                        + Diphtheria + Income_composition_of_resources, data = life_data_clean_trn)

model_back_aic_pred = step(model_additive_pred, trace = 0)
model_back_bic_pred = step(model_additive_pred, trace = 0, k = log(nrow(life_data_clean_trn)))

model_fwd_aic_pred = step(lm (Life_expectancy ~ 1, data = life_data_clean_trn),
                        scope =Life_expectancy ~ 
                        Year
                        + Status
                        + Income_composition_of_resources
                        + Adult_Mortality
                        + Total_expenditure
                        + Alcohol
                        + Percentage_expenditure
                        + Hepatitis_B
                        + Measles
                        + BMI
                        + Under_5_deaths 
                        + Polio
                        + HIV_AIDS
                        + Diphtheria
                        + Thinness_1_to_19_yrs
                        + Schooling
                        ,
                        direction = "forward", trace = 0)

model_fwd_bic_pred = step(lm (Life_expectancy ~ 1, data = life_data_clean_trn),
                        scope =Life_expectancy ~ 
                        Year
                       + Status
                        + Income_composition_of_resources
                        + Adult_Mortality
                        + Total_expenditure
                        + Alcohol
                        + Percentage_expenditure
                        + Hepatitis_B
                        + Measles
                        + BMI
                        + Under_5_deaths 
                        + Polio
                        + HIV_AIDS
                        + Diphtheria
                        + Thinness_1_to_19_yrs
                        ,
                        direction = "forward", k = log(nrow(life_data_clean_trn)), trace = 0)
model_poly_pred = lm(Life_expectancy~sqrt(Adult_Mortality)*(Income_composition_of_resources^2)*Percentage_expenditure*Schooling*BMI+HIV_AIDS+Diphtheria+Polio+Year*Status,data=life_data_clean_trn)
model_prediction_data = function(model) {
  res = list(
    rmse_test = round(sqrt(mean((life_data_clean_tst$Life_expectancy - predict(model, life_data_clean_tst)) ^ 2)), 4),
    rsme_train = round(sqrt(mean((life_data_clean_trn$Life_expectancy - predict(model, life_data_clean_trn)) ^ 2)), 4)  
  )
}

df_result = rbind( additive = model_prediction_data(model_additive_pred),
                   #transform = model_prediction_data(model_transform_pred),
                   best_aic = model_prediction_data(model_best_aic_pred),
                   best_bic = model_prediction_data(model_best_bic_pred),
                   back_aic = model_prediction_data(model_back_aic_pred),
                   back_bic = model_prediction_data(model_back_bic_pred),
                   fwd_aic = model_prediction_data(model_fwd_aic_pred),
                   fwd_bic = model_prediction_data(model_fwd_bic_pred),
                   poly = model_prediction_data(model_poly_pred)
            )
                     


knitr::kable(df_result)
rmse_test rsme_train
additive 3.5654 3.6408
best_aic 3.5278 3.7212
best_bic 3.5278 3.7212
back_aic 3.5793 3.6424
back_bic 3.6147 3.6642
fwd_aic 3.5793 3.6424
fwd_bic 3.9932 3.8947
poly 4.3949 3.521

The best RMSE for both train and test data sets shows polynomial model

2.4.3 Model Prediction

testing = predict(model_poly_pred,life_data_clean_tst)

#compare actual and predicted values of Life_expectancy
df_pred = head(cbind(actual = life_data_clean_tst$Life_expectancy,predicted = testing))

knitr::kable(df_pred,row.names = FALSE,caption = "Actual Vs Predicted")
Table 2.4: Actual Vs Predicted
actual predicted
58.1 60.2358
57.3 59.4986
57.3 59.2965
54.8 56.8075
75.3 74.8445
75.9 73.9744

As we concluded above, we will use polynomial model going forward. Below is the formula of the model selected.

summary(model_poly)
## 
## Call:
## lm(formula = Life_expectancy ~ sqrt(Adult_Mortality) * (Income_composition_of_resources^2) * 
##     Percentage_expenditure * Schooling * BMI + HIV_AIDS + Diphtheria + 
##     Polio + Year * Status, data = life_data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.078  -2.039   0.039   2.224  12.375 
## 
## Coefficients:
##                                                                                             Estimate
## (Intercept)                                                                                -3.27e+02
## sqrt(Adult_Mortality)                                                                       1.75e+00
## Income_composition_of_resources                                                             4.75e+01
## Percentage_expenditure                                                                      2.17e-01
## Schooling                                                                                   3.89e+00
## BMI                                                                                         7.02e-01
## HIV_AIDS                                                                                   -5.28e-01
## Diphtheria                                                                                  1.64e-02
## Polio                                                                                       5.28e-03
## Year                                                                                        1.72e-01
## StatusDeveloping                                                                            7.01e+02
## sqrt(Adult_Mortality):Income_composition_of_resources                                      -2.92e+00
## sqrt(Adult_Mortality):Percentage_expenditure                                               -1.32e-02
## Income_composition_of_resources:Percentage_expenditure                                     -2.15e-01
## sqrt(Adult_Mortality):Schooling                                                            -1.93e-01
## Income_composition_of_resources:Schooling                                                  -3.00e+00
## Percentage_expenditure:Schooling                                                           -1.56e-02
## sqrt(Adult_Mortality):BMI                                                                  -1.67e-02
## Income_composition_of_resources:BMI                                                        -6.97e-01
## Percentage_expenditure:BMI                                                                 -4.14e-03
## Schooling:BMI                                                                              -5.97e-02
## Year:StatusDeveloping                                                                      -3.49e-01
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure                1.16e-02
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling                             2.54e-01
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling                                      1.05e-03
## Income_composition_of_resources:Percentage_expenditure:Schooling                            1.58e-02
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI                                   3.34e-02
## sqrt(Adult_Mortality):Percentage_expenditure:BMI                                            2.90e-04
## Income_composition_of_resources:Percentage_expenditure:BMI                                  4.15e-03
## sqrt(Adult_Mortality):Schooling:BMI                                                         1.80e-03
## Income_composition_of_resources:Schooling:BMI                                               6.13e-02
## Percentage_expenditure:Schooling:BMI                                                        3.07e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling     -9.84e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI           -2.68e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI                        -3.01e-03
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI                                 -2.28e-05
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI                       -3.14e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI  2.21e-05
##                                                                                            Std. Error
## (Intercept)                                                                                  1.12e+02
## sqrt(Adult_Mortality)                                                                        5.28e-01
## Income_composition_of_resources                                                              1.42e+01
## Percentage_expenditure                                                                       5.10e-02
## Schooling                                                                                    7.19e-01
## BMI                                                                                          2.67e-01
## HIV_AIDS                                                                                     1.71e-02
## Diphtheria                                                                                   5.31e-03
## Polio                                                                                        5.10e-03
## Year                                                                                         5.56e-02
## StatusDeveloping                                                                             1.22e+02
## sqrt(Adult_Mortality):Income_composition_of_resources                                        1.06e+00
## sqrt(Adult_Mortality):Percentage_expenditure                                                 3.98e-03
## Income_composition_of_resources:Percentage_expenditure                                       6.00e-02
## sqrt(Adult_Mortality):Schooling                                                              5.10e-02
## Income_composition_of_resources:Schooling                                                    1.12e+00
## Percentage_expenditure:Schooling                                                             3.78e-03
## sqrt(Adult_Mortality):BMI                                                                    2.05e-02
## Income_composition_of_resources:BMI                                                          4.29e-01
## Percentage_expenditure:BMI                                                                   1.00e-03
## Schooling:BMI                                                                                2.11e-02
## Year:StatusDeveloping                                                                        6.06e-02
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure                 4.94e-03
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling                              8.80e-02
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling                                       3.17e-04
## Income_composition_of_resources:Percentage_expenditure:Schooling                             4.41e-03
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI                                    3.46e-02
## sqrt(Adult_Mortality):Percentage_expenditure:BMI                                             8.24e-05
## Income_composition_of_resources:Percentage_expenditure:BMI                                   1.12e-03
## sqrt(Adult_Mortality):Schooling:BMI                                                          1.70e-03
## Income_composition_of_resources:Schooling:BMI                                                3.12e-02
## Percentage_expenditure:Schooling:BMI                                                         7.15e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling       3.82e-04
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI             9.35e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI                          2.65e-03
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI                                   6.30e-06
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI                         7.96e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI   7.04e-06
##                                                                                            t value
## (Intercept)                                                                                  -2.93
## sqrt(Adult_Mortality)                                                                         3.31
## Income_composition_of_resources                                                               3.34
## Percentage_expenditure                                                                        4.25
## Schooling                                                                                     5.41
## BMI                                                                                           2.63
## HIV_AIDS                                                                                    -30.86
## Diphtheria                                                                                    3.08
## Polio                                                                                         1.04
## Year                                                                                          3.10
## StatusDeveloping                                                                              5.76
## sqrt(Adult_Mortality):Income_composition_of_resources                                        -2.74
## sqrt(Adult_Mortality):Percentage_expenditure                                                 -3.32
## Income_composition_of_resources:Percentage_expenditure                                       -3.59
## sqrt(Adult_Mortality):Schooling                                                              -3.79
## Income_composition_of_resources:Schooling                                                    -2.68
## Percentage_expenditure:Schooling                                                             -4.13
## sqrt(Adult_Mortality):BMI                                                                    -0.81
## Income_composition_of_resources:BMI                                                          -1.62
## Percentage_expenditure:BMI                                                                   -4.13
## Schooling:BMI                                                                                -2.83
## Year:StatusDeveloping                                                                        -5.76
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure                  2.35
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling                               2.88
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling                                        3.32
## Income_composition_of_resources:Percentage_expenditure:Schooling                              3.58
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI                                     0.96
## sqrt(Adult_Mortality):Percentage_expenditure:BMI                                              3.52
## Income_composition_of_resources:Percentage_expenditure:BMI                                    3.71
## sqrt(Adult_Mortality):Schooling:BMI                                                           1.06
## Income_composition_of_resources:Schooling:BMI                                                 1.96
## Percentage_expenditure:Schooling:BMI                                                          4.29
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling       -2.58
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI             -2.86
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI                          -1.14
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI                                   -3.62
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI                         -3.94
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI    3.14
##                                                                                            Pr(>|t|)
## (Intercept)                                                                                 0.00343
## sqrt(Adult_Mortality)                                                                       0.00096
## Income_composition_of_resources                                                             0.00085
## Percentage_expenditure                                                                      2.3e-05
## Schooling                                                                                   7.4e-08
## BMI                                                                                         0.00864
## HIV_AIDS                                                                                    < 2e-16
## Diphtheria                                                                                  0.00210
## Polio                                                                                       0.30028
## Year                                                                                        0.00199
## StatusDeveloping                                                                            1.0e-08
## sqrt(Adult_Mortality):Income_composition_of_resources                                       0.00614
## sqrt(Adult_Mortality):Percentage_expenditure                                                0.00093
## Income_composition_of_resources:Percentage_expenditure                                      0.00035
## sqrt(Adult_Mortality):Schooling                                                             0.00016
## Income_composition_of_resources:Schooling                                                   0.00747
## Percentage_expenditure:Schooling                                                            3.8e-05
## sqrt(Adult_Mortality):BMI                                                                   0.41624
## Income_composition_of_resources:BMI                                                         0.10449
## Percentage_expenditure:BMI                                                                  3.8e-05
## Schooling:BMI                                                                               0.00471
## Year:StatusDeveloping                                                                       9.9e-09
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure                0.01876
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling                             0.00398
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling                                      0.00092
## Income_composition_of_resources:Percentage_expenditure:Schooling                            0.00036
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI                                   0.33487
## sqrt(Adult_Mortality):Percentage_expenditure:BMI                                            0.00044
## Income_composition_of_resources:Percentage_expenditure:BMI                                  0.00022
## sqrt(Adult_Mortality):Schooling:BMI                                                         0.29058
## Income_composition_of_resources:Schooling:BMI                                               0.04960
## Percentage_expenditure:Schooling:BMI                                                        1.9e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling      0.01004
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI            0.00425
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI                         0.25645
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI                                  0.00030
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI                        8.5e-05
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI  0.00175
##                                                                                               
## (Intercept)                                                                                ** 
## sqrt(Adult_Mortality)                                                                      ***
## Income_composition_of_resources                                                            ***
## Percentage_expenditure                                                                     ***
## Schooling                                                                                  ***
## BMI                                                                                        ** 
## HIV_AIDS                                                                                   ***
## Diphtheria                                                                                 ** 
## Polio                                                                                         
## Year                                                                                       ** 
## StatusDeveloping                                                                           ***
## sqrt(Adult_Mortality):Income_composition_of_resources                                      ** 
## sqrt(Adult_Mortality):Percentage_expenditure                                               ***
## Income_composition_of_resources:Percentage_expenditure                                     ***
## sqrt(Adult_Mortality):Schooling                                                            ***
## Income_composition_of_resources:Schooling                                                  ** 
## Percentage_expenditure:Schooling                                                           ***
## sqrt(Adult_Mortality):BMI                                                                     
## Income_composition_of_resources:BMI                                                           
## Percentage_expenditure:BMI                                                                 ***
## Schooling:BMI                                                                              ** 
## Year:StatusDeveloping                                                                      ***
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure               *  
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling                            ** 
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling                                     ***
## Income_composition_of_resources:Percentage_expenditure:Schooling                           ***
## sqrt(Adult_Mortality):Income_composition_of_resources:BMI                                     
## sqrt(Adult_Mortality):Percentage_expenditure:BMI                                           ***
## Income_composition_of_resources:Percentage_expenditure:BMI                                 ***
## sqrt(Adult_Mortality):Schooling:BMI                                                           
## Income_composition_of_resources:Schooling:BMI                                              *  
## Percentage_expenditure:Schooling:BMI                                                       ***
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling     *  
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI           ** 
## sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI                           
## sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI                                 ***
## Income_composition_of_resources:Percentage_expenditure:Schooling:BMI                       ***
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.57 on 1611 degrees of freedom
## Multiple R-squared:  0.839,  Adjusted R-squared:  0.835 
## F-statistic:  227 on 37 and 1611 DF,  p-value: <2e-16

2.4.4 Individual parameter significance test for selected model

Looking at the diagnostics, we will check can our model still do better. We will now look at the individual significant of the parameters of this model to see if we can eliminate any predictors

names(coef(model_poly))
##  [1] "(Intercept)"                                                                               
##  [2] "sqrt(Adult_Mortality)"                                                                     
##  [3] "Income_composition_of_resources"                                                           
##  [4] "Percentage_expenditure"                                                                    
##  [5] "Schooling"                                                                                 
##  [6] "BMI"                                                                                       
##  [7] "HIV_AIDS"                                                                                  
##  [8] "Diphtheria"                                                                                
##  [9] "Polio"                                                                                     
## [10] "Year"                                                                                      
## [11] "StatusDeveloping"                                                                          
## [12] "sqrt(Adult_Mortality):Income_composition_of_resources"                                     
## [13] "sqrt(Adult_Mortality):Percentage_expenditure"                                              
## [14] "Income_composition_of_resources:Percentage_expenditure"                                    
## [15] "sqrt(Adult_Mortality):Schooling"                                                           
## [16] "Income_composition_of_resources:Schooling"                                                 
## [17] "Percentage_expenditure:Schooling"                                                          
## [18] "sqrt(Adult_Mortality):BMI"                                                                 
## [19] "Income_composition_of_resources:BMI"                                                       
## [20] "Percentage_expenditure:BMI"                                                                
## [21] "Schooling:BMI"                                                                             
## [22] "Year:StatusDeveloping"                                                                     
## [23] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure"              
## [24] "sqrt(Adult_Mortality):Income_composition_of_resources:Schooling"                           
## [25] "sqrt(Adult_Mortality):Percentage_expenditure:Schooling"                                    
## [26] "Income_composition_of_resources:Percentage_expenditure:Schooling"                          
## [27] "sqrt(Adult_Mortality):Income_composition_of_resources:BMI"                                 
## [28] "sqrt(Adult_Mortality):Percentage_expenditure:BMI"                                          
## [29] "Income_composition_of_resources:Percentage_expenditure:BMI"                                
## [30] "sqrt(Adult_Mortality):Schooling:BMI"                                                       
## [31] "Income_composition_of_resources:Schooling:BMI"                                             
## [32] "Percentage_expenditure:Schooling:BMI"                                                      
## [33] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling"    
## [34] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI"          
## [35] "sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI"                       
## [36] "sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI"                                
## [37] "Income_composition_of_resources:Percentage_expenditure:Schooling:BMI"                      
## [38] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI"

The above are all the coefficients of the model. We will use them to compare to the below filtered list of p-values > .1

a <- coef(summary(model_poly))[,"Pr(>|t|)"] 
names(a[a>.01])
## [1] "Polio"                                                                                 
## [2] "sqrt(Adult_Mortality):BMI"                                                             
## [3] "Income_composition_of_resources:BMI"                                                   
## [4] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure"          
## [5] "sqrt(Adult_Mortality):Income_composition_of_resources:BMI"                             
## [6] "sqrt(Adult_Mortality):Schooling:BMI"                                                   
## [7] "Income_composition_of_resources:Schooling:BMI"                                         
## [8] "sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling"
## [9] "sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI"

All parameters are significant

2.4.5 Variance Inflation factor identification

We look at variance inflation factors, and filter by only vifs that are >5

car::vif(model_poly)
##                                                                      sqrt(Adult_Mortality) 
##                                                                                8.91090e+02 
##                                                            Income_composition_of_resources 
##                                                                                8.74100e+02 
##                                                                     Percentage_expenditure 
##                                                                                1.04136e+06 
##                                                                                  Schooling 
##                                                                                5.21921e+02 
##                                                                                        BMI 
##                                                                                3.59141e+03 
##                                                                                   HIV_AIDS 
##                                                                                1.37393e+00 
##                                                                                 Diphtheria 
##                                                                                1.69488e+00 
##                                                                                      Polio 
##                                                                                1.69129e+00 
##                                                                                       Year 
##                                                                                6.66083e+00 
##                                                                                     Status 
##                                                                                2.39484e+05 
##                                      sqrt(Adult_Mortality):Income_composition_of_resources 
##                                                                                1.24285e+03 
##                                               sqrt(Adult_Mortality):Percentage_expenditure 
##                                                                                4.03822e+05 
##                                     Income_composition_of_resources:Percentage_expenditure 
##                                                                                1.14009e+06 
##                                                            sqrt(Adult_Mortality):Schooling 
##                                                                                9.94743e+02 
##                                                  Income_composition_of_resources:Schooling 
##                                                                                2.24803e+03 
##                                                           Percentage_expenditure:Schooling 
##                                                                                1.45994e+06 
##                                                                  sqrt(Adult_Mortality):BMI 
##                                                                                3.21407e+03 
##                                                        Income_composition_of_resources:BMI 
##                                                                                7.19492e+03 
##                                                                 Percentage_expenditure:BMI 
##                                                                                1.26814e+06 
##                                                                              Schooling:BMI 
##                                                                                5.84200e+03 
##                                                                                Year:Status 
##                                                                                2.39630e+05 
##               sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure 
##                                                                                4.86486e+05 
##                            sqrt(Adult_Mortality):Income_composition_of_resources:Schooling 
##                                                                                1.85237e+03 
##                                     sqrt(Adult_Mortality):Percentage_expenditure:Schooling 
##                                                                                6.41285e+05 
##                           Income_composition_of_resources:Percentage_expenditure:Schooling 
##                                                                                1.58696e+06 
##                                  sqrt(Adult_Mortality):Income_composition_of_resources:BMI 
##                                                                                5.47005e+03 
##                                           sqrt(Adult_Mortality):Percentage_expenditure:BMI 
##                                                                                5.33188e+05 
##                                 Income_composition_of_resources:Percentage_expenditure:BMI 
##                                                                                1.25602e+06 
##                                                        sqrt(Adult_Mortality):Schooling:BMI 
##                                                                                4.59203e+03 
##                                              Income_composition_of_resources:Schooling:BMI 
##                                                                                9.84693e+03 
##                                                       Percentage_expenditure:Schooling:BMI 
##                                                                                1.72039e+06 
##     sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling 
##                                                                                7.33117e+05 
##           sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:BMI 
##                                                                                5.36506e+05 
##                        sqrt(Adult_Mortality):Income_composition_of_resources:Schooling:BMI 
##                                                                                7.35438e+03 
##                                 sqrt(Adult_Mortality):Percentage_expenditure:Schooling:BMI 
##                                                                                8.10010e+05 
##                       Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 
##                                                                                1.70870e+06 
## sqrt(Adult_Mortality):Income_composition_of_resources:Percentage_expenditure:Schooling:BMI 
##                                                                                8.01228e+05

We notice that there are no high vif values so we make no changes

2.4.6 Influential points identification and handling

We will now look at high influence points and investigate them

As an experiment we try and remove the influentials and see what impact this has on the diagnostics

sum(cooks.distance(model_poly) > 4 / length(cooks.distance(model_poly)))
## [1] 101
inf_points_removed = subset(life_data_clean, cooks.distance(model_poly) <= 4 / length(cooks.distance(model_poly)))


model_poly_inf_removed = lm(Life_expectancy~sqrt(Adult_Mortality)*(Income_composition_of_resources^2)*Percentage_expenditure*Schooling*BMI+HIV_AIDS+Diphtheria+Polio+Year*Status,data=inf_points_removed)

2.4.6.1 Diagnostics Comparison

Now we compare the diagnostics data

model_with_influential = diagnostics(model_poly)

model_without_influential = diagnostics(model_poly_inf_removed)

df_result = rbind(model_with_influential=model_with_influential, model_without_influential=model_without_influential)

knitr::kable(df_result, align = "cccccccc", caption = "Model Comparison with and without Influential datapoints")
Table 2.5: Model Comparison with and without Influential datapoints
num_predictors sw_pvalue sw_decision bp_pvalue bp_decision RMSE AdjustedR2 AIC
model_with_influential 37 6.78173115056804e-08 Reject 6.79157135486358e-17 Reject 3.5305 0.83513 4236.26
model_without_influential 37 0.0000549230744637251 Reject 0.0107460089192416 Fail to Reject. 2.9627 0.87119 3438.56

We see that our diagnostics have improved significantly, including BP test and saphiro test p-values, lower RMSE and higher adj R2.

We will have to sacrifice about 9% of the observations but the improvements in RMSE and Adjusted Rsquare are significant.

So now we know that it is the influential points that are causing our model to have less than ideal diagnostics and hence we will discard the influential points and select the new model as our better model

2.5 Selected model

Hence now our good model is model_without_influential and the dataset is inf_points_removed

3 Results

3.1 Comparison of all models

We have already seen this in various places but we will now compare the diagnostics of all the models that we have seen to see how we have progressed

Table 3.1: Model Comparison Summary
num_predictors sw_pvalue sw_decision bp_pvalue bp_decision RMSE AdjustedR2 AIC
m_additive 17 5.80987998863635e-08 Reject 6.41579265092972e-33 Reject 3.6097 0.82976 4269.41
m_transform_1 12 5.58980035080847e-15 Reject 3.40055212895347e-86 Reject 4.8562 0.71269 21269.62
m_transform_1_step 12 5.58980035080847e-15 Reject 3.40055212895347e-86 Reject 4.8573 0.71269 21269.62
model_min_aic 8 6.15431590889256e-10 Reject 3.4663551604506e-29 Reject 3.6678 0.8252 4304.07
model_min_bic 8 6.15431590889256e-10 Reject 3.46635516045084e-29 Reject 3.6678 0.8252 4304.07
model_back_aic 14 4.22813740389515e-08 Reject 1.3205694135825e-34 Reject 3.6138 0.82969 4267.14
model_back_bic 10 2.23394873540412e-08 Reject 4.56792467310997e-35 Reject 3.6319 0.8284 4275.64
model_fwd_aic 13 2.64491669882093e-08 Reject 1.37799693326893e-34 Reject 3.617 0.82949 4268.02
model_fwd_bic 10 5.30301565996376e-13 Reject 1.60142326938387e-81 Reject 3.8867 0.80348 4499.21
model_poly 37 6.78173115056804e-08 Reject 6.79157135486358e-17 Reject 3.5305 0.83513 4236.26
model_without_influential 37 0.0000549230744637251 Reject 0.0107460089192416 Fail to Reject. 2.9627 0.87119 3438.56

3.2 Diagnostic plots of Selected Model

We also look at the diagnostics plot of our selected model

Our diagnostics plots look fairly good. Using an \(\alpha = .01\), even BP test validate the equal variance assumptions. Hence we will select model_ploy as our final model.

4 Discussion

We spent significant time inspecting, cleaning the data and identifying the best possible model. Majorly time was spent in trying various predictors, their combinations and possible transformations. We looked at variance inflation factors and removed variables with high vifs. Visual pair plots, collinearity and partial correlation coefficient mechanism was applied repetitively to identify collinearity in regression models.

We also used boxcox to identify possible response and predictor transformations needed.

Multiple model selection methodology is applied including but not limited to simple additive, step-wise forward and backward using AIC & BIC, interactive, polynomial and their combination.

We spent a significant time looking at the individual significance test and experimenting with the variables there. This analysis had a limited benefit, and resulted in only making our diagnostic parameters worse. We still picked the large interactive polynomial model satisfying all major criteria such as train and test RMSE, adjusted \(R^2\) and passing bp test. We also run anova test on top three models

We then investigated the result of removing influential observations, this had some improvement on the diagnostics, and it seemed like a good compromise

After looking at the diagnostics plot we come to the final conclusion that we have a good enough model for identifying the key predictors for explaining the model and predicting with reasonable accuracy.

As we can see from the diagnostic statistics above, the model is useful since it has a fairly low RMSE with a very high adjusted \(R^2\) and AIC is also very low.

Hence we can conclude that we have a fairly good model.

Final Note: Contrary to our initial assumption, Life expectancy data analysis and designing a useful model was not straight-forward and was a challenging task. After trying several different combinations our team came up with a complex model. The combination of predictor interaction and transformation makes our model slightly complex but variable selection and their interpretation looks relevant for Life Expectancy. So we conclude that the model we have selected is good for prediction.

5 Appendix

5.1 Team members

Table 5.1: Team Members
Name NetID Email
Boris Tsekinovsky borist3
Sid Rathi rathi9
Suraj Bisht surajb2

5.2 Additional Diagnostic plots of Selected Model

#histograms
hist(inf_points_removed$BMI,main="Distribution of BMI",xlab = "BMI",ylab="Count", col = "blue",ylim=c(0,300))

hist(inf_points_removed$Hepatitis_B, main="Distribution of Hepatitis B",xlab = "Hepatitis B",ylab="Count", col = "blue")

hist(inf_points_removed$Measles,main="Distribution of Measles",xlab = "Measles",ylab="Count", col = "blue")

hist(inf_points_removed$Polio,main="Distribution of Polio",xlab = "Polio",ylab="Count", col = "blue")

hist(inf_points_removed$Diphtheria,main="Distribution of Diphtheria",xlab = "Diphtheria",ylab="Count", col = "blue")

hist(inf_points_removed$Adult_Mortality,main="Distribution of Adult Mortality",xlab = "Adult Mortality",ylab="Count", col = "blue")

hist(inf_points_removed$Alcohol,main="Distribution of Alcohol",xlab = "Alcohol",ylab="Count", col = "blue")

hist(inf_points_removed$HIV_AIDS,main="Distribution of HIV AIDS",xlab = "HIV AIDS",ylab="Count", col = "blue")

hist(inf_points_removed$Percentage_expenditure,main="Distribution of Percentage Expenditure",xlab = "Percentage Expenditure",ylab="Count", col = "blue")

hist(inf_points_removed$Life_expectancy,main="Distribution of Life Expectancy",xlab = "Life Expectancy", ylab="Count", col = "blue")

hist(inf_points_removed$Under_5_deaths,main="Distribution of Under Five Deaths ",xlab = "Under Five Deaths",ylab="Count", col = "blue",ylim=c(0,2000))

hist(inf_points_removed$Thinness_1_to_19_yrs,main="Distribution of thinness 1-19",xlab = "Thinness 1-19",ylab="Count", col = "blue",ylim=c(0,300))

```